My journal entry of Assignment 3 can be found here.
Please have all the files and directory in my A3 directory in GitHub repo in the same directory of the Rmd file in order to complie the Rmd file. The file strcture should look like below:
The data is from GSE178521 (ZHENG et al., 2022). This data set have 6 samples, 3 replicates for the control with GFP knockdown (GFPkd1, GFPkd2, GFPkd3) and 3 replicates for the test condition with PRMT5 knockdown (PRMT5kd1, PRMT5kd2, PRMT5kd3). The author aimed to figure out the function of PRMT5 in cancer cells.
I cleaned the data by removing the genes with 0 count for at least 3 samples. We got 15159 genes left. I did not removed any outliers.
I normalized the data using Trimmed Mean of M-values (TMM) normalization method in the edgeR (Robinson, McCarthy, & Smyth, 2010) package.
I then performed the differential gene expression analysis with edgeR where the conditions (PRMT5kd or GFPkd) was the only factor that I used in the design. The Benjamni-Hochberg (BH) correction (Benjamini & Hochberg, 1995) was used.
I performed the thresholded over-representation analysis using g:Profiler (Raudvere et al., 2019). I didn’t get the terms such as “negative regulation of T cells” and “type I INFγ response” which was shown to be enriched for the differentially expressed genes in the PRMT5 knockdown cells (PRMT5kd) (ZHENG et al., 2022). The reason is that the important genes related to these terms, such as Arg2 and CD274, didn’t pass the p-value threshold < 0.05.
To perform GSEA (Subramanian et al., 2005), I needed to create the ranked set of genes. I got the top hits from edgeR differential gene expression analysis, where in total there are 15159 genes. I calculate the rank with the formula -log10(p-value) * sign(logFC).
library(knitr)
# load the edgeR result from A2
qlf_output_hits <- readRDS("./qlf_output_hits.rds")
# added rank = -log10(p-value) * sign(logFC) into the dataframe
qlf_output_hits[,"rank"] <- -log(qlf_output_hits$PValue, base = 10) * sign(qlf_output_hits$logFC)
# order in decreasing order of rank
qlf_output_hits <- qlf_output_hits[order(qlf_output_hits$rank, decreasing = TRUE),]
# make the ranked list
ranked_genes <- data.frame(GeneName = rownames(qlf_output_hits), rank = qlf_output_hits[,"rank"])
write.table(ranked_genes, "./all_ranked_genes.rnk", quote = FALSE, row.names = FALSE, sep = "\t")
kable(ranked_genes[c(1:5, 15155:15159),], type="pipe")
| GeneName | rank | |
|---|---|---|
| 1 | ADAM19 | 5.525852 |
| 2 | DDIT3 | 5.383518 |
| 3 | FGG | 5.160587 |
| 4 | OLFML2A | 4.704579 |
| 5 | LIFR | 4.479153 |
| 15155 | HSP90AA1 | -4.157425 |
| 15156 | FRMD8 | -4.468215 |
| 15157 | PRMT5 | -6.327536 |
| 15158 | HYPK | -6.346021 |
| 15159 | IL1R2 | -6.667572 |
I used the baderlab geneset collection (Merico, Isserlin, Stueker, Emili, & Bader, 2010) from March 1, 2021 containing GO biological process, no IEA and pathways (Human_GOBP_AllPathways_no_GO_iea_March_01_2021_symbol.gmt) as our geneset database.
I ran GSEAPreranked with the following parameters (GSEA v4.2.3):
Many top terms for na_pos phenotype (up-regulated genes in PRMT5kd cells) are related to collagen and coagulation and wound healing, which is not seen in thresholded analysis results for up-regulated genes. Whereas in thresholded analysis results, the top terms are related to cell development. Since the paper mentions two terms “negative regulation of T cells” and “type I INFγ response” (ZHENG et al., 2022), I specifically looked for those two terms. In the thresholded analysis results, neither of them appears, while both terms appears in the GSEA results even though they are with a poor p-value. In A2, I found that the genes such as such as Arg2 and CD274, didn’t pass the p-value threshold < 0.05, so those genes are filtered out in thresholded analysis, that’s why the two terms can’t be found there. However, here we didn’t filter out those genes based on their p-value, so we would get the two terms in the results but not significantly.
NEGATIVE REGULATION OF T CELL ACTIVATION%GOBP%GO:0050868
TYPE I INTERFERON SIGNALING PATHWAY%GOBP%GO:0060337
For na_neg phenotype (down-regulated genes in PRMT5kd cells), I saw many terms related to basic biological processes, as seen in the thresholded analysis results.
The comparsion of the threshould analysis and non-thresholded analysis is not that straight forward to me since they used different annotation data source (except GO:BP). However, from the results I can still see that those non-significant genes play some roles on the analysis.
I used Enrichment Map App and set FDR q-value cutoff to 0.01 to visulize the GSEA results.
There are 372 nodes and 6303 edges in the resulting map. The node cutoff is 0.01 and edge cutoff is 0.375.
Figure 1: Cytoscape settings for the intial enrichment map
Figure 2: Enrichment Map for my GSEA results
I used the AutoAnnotate App in Cytoscape to add annotation to my results with the following steps:
Figure 3: Result after annotation using AutoAnnotate
I made the publication ready figure by checking the “Publication-Ready” in EnrichmentMap panel.
Figure 4: Publication ready figure for our network
Figure 5: Legend of the enrichment map
I tried to create a theme network by collapsing all clusters in AutoAnnotate, but I could only get the figure below, without any words.
However, I would still looked at Figure 4 to find some major themes. The major themes includes apc g1 degradation, coupled transport electron, and decay nonsense targeting. I can’t really fit those themes to the model, and I don’t think there are any novel pathway or theme.
Figure 6: Failed theme network plot
I added a signature analysis to the network using the nutraceutical drugs from DrugBank. I downloaded the annotation file from baderlab (Merico et al., 2010) from March 1, 2021 collection.
I used the Mann-Whitney (two-sided) test with a cutoff of 0.05.
Below are the terms that overlap:
Figure 7: Overlapped gene sets from nutraceutical drugs
The nutraceutical gene sets are mainly overlapped with the down-regulated gene sets in our dataset. I searched the NADH in DrugBank and NADH always serves as an electron carrier, which makes sense for the large overlap between the nutraceutical gene sets and the “coupled transport electron” which is down-regulated in the PRMT5kd cells.
There are also papers and clinical trials shows the use of nutraceuticals in the treatment of lung cancers.
Figure 8: Enrichment Map with the nutraceutical gene sets